Transformer-based models have been widely demonstrated to be successful in computer vision tasks by modelling long-range dependencies and capturing global representations. However, they are often dominated by features of large patterns leading to the loss of local details (e.g., boundaries and small objects), which are critical in medical image segmentation. To alleviate this problem, we propose a Dual-Aggregation Transformer Network called DuAT, which is characterized by two innovative designs, namely, the Global-to-Local Spatial Aggregation (GLSA) and Selective Boundary Aggregation (SBA) modules. The GLSA has the ability to aggregate and represent both global and local spatial features, which are beneficial for locating large and small objects, respectively. The SBA module is used to aggregate the boundary characteristic from low-level features and semantic information from high-level features for better preserving boundary details and locating the re-calibration objects. Extensive experiments in six benchmark datasets demonstrate that our proposed model outperforms state-of-the-art methods in the segmentation of skin lesion images, and polyps in colonoscopy images. In addition, our approach is more robust than existing methods in various challenging situations such as small object segmentation and ambiguous object boundaries.
translated by 谷歌翻译
Recently, flow-based frame interpolation methods have achieved great success by first modeling optical flow between target and input frames, and then building synthesis network for target frame generation. However, above cascaded architecture can lead to large model size and inference delay, hindering them from mobile and real-time applications. To solve this problem, we propose a novel Progressive Motion Context Refine Network (PMCRNet) to predict motion fields and image context jointly for higher efficiency. Different from others that directly synthesize target frame from deep feature, we explore to simplify frame interpolation task by borrowing existing texture from adjacent input frames, which means that decoder in each pyramid level of our PMCRNet only needs to update easier intermediate optical flow, occlusion merge mask and image residual. Moreover, we introduce a new annealed multi-scale reconstruction loss to better guide the learning process of this efficient PMCRNet. Experiments on multiple benchmarks show that proposed approaches not only achieve favorable quantitative and qualitative results but also reduces current model size and running time significantly.
translated by 谷歌翻译
视频框架插值是一项经典且具有挑战性的低级计算机视觉任务。最近,基于深度学习的方法取得了令人印象深刻的结果,并且已证明基于光流的方法可以合成具有更高质量的帧。但是,大多数基于流动的方法都假设两个输入帧之间具有恒定速度的线轨迹。只有一点点工作可以使用曲线轨迹执行预测,但这需要两个以上的框架作为输入来估计加速度,这需要更多的时间和内存才能执行。为了解决这个问题,我们提出了一个基于ARC轨迹的模型(ATCA),该模型仅从连续两个帧中就可以在前学习运动,而且轻量级。实验表明,我们的方法的性能要比许多参数较少且推理速度更快的SOTA方法更好。
translated by 谷歌翻译
最近,大多数手写的数学表达识别(HMER)方法采用编码器 - 编码器网络,该网络直接从具有注意机制的公式图像中直接预测标记序列。但是,此类方法可能无法准确读取具有复杂结构的公式或生成长的标记序列,因为由于写作样式或空间布局的差异很大,注意结果通常是不准确的。为了减轻此问题,我们为HMER提出了一个名为Counting-Aware-Aware网络(CAN)的非常规网络,该网络共同优化了两个任务:HMER和符号计数。具体而言,我们设计了一个弱监督的计数模块,该模块可以预测每个符号类的数量,而无需符号级别的位置注释,然后将其插入HMER的典型基于注意力的编码器模型。在基准数据集上进行的实验验证了关节优化和计数结果既有益于纠正编码器模型的预测误差,又可以始终如一地胜过最先进的方法。特别是,与HMER的编码器模型相比,提议的计数模块引起的额外时间成本是边缘的。源代码可从https://github.com/lbh1024/can获得。
translated by 谷歌翻译
传统的生物和制药工厂由人类工人或预定义阈值控制。现代化的工厂具有高级过程控制算法,例如模型预测控制(MPC)。但是,几乎没有探索将深入的增强学习来控制制造厂。原因之一是缺乏高保真模拟和基准测试的标准API。为了弥合这一差距,我们开发了一个易于使用的库,其中包括五个高保真模拟环境:BeerfMtenV,Reactorenv,Atropineenv,Pensimenv和Mabenv,涵盖了广泛的制造过程。我们在已发布的动态模型上构建这些环境。此外,我们在线和离线基准基准,基于模型和无模型的强化学习算法,用于比较后续研究。
translated by 谷歌翻译
对象异常的检测对于工业过程至关重要,但是由于难以获得大量有缺陷的样本以及现实生活中无法预测的异常类型,因此无监督的异常检测和定位尤为重要。在现有的无监督异常检测和定位方法中,基于NF的方案取得了更好的结果。但是,两个子网(复杂函数)$ s_ {i}(u_ {i})$和$ t_ {i}(u_ {i})在nf中通常是多层的perceptrons,需要从2D扁平至1D,破坏了特征图中的空间位置关系并丢失空间结构信息。为了保留并有效提取空间结构信息,我们在这项研究中设计了一个复杂的函数模型,该模型具有交替的CBAM嵌入在堆叠的$ 3 \ times3 $全卷积中,该卷积能够保留并有效地在标准化流程模型中提取空间结构信息。 MVTEC AD数据集的广泛实验结果表明,Cainnflow基于CNN和Transformer Backbone网络作为特征提取器达到高级准确性和推理效率,并且Cainnflow可在MVTEC广告中获得$ 98.64 \%的像素级AUC $ 98.64 \%\%。
translated by 谷歌翻译
实体联系面临着重大的挑战,例如多产的变化和普遍的歧义,特别是在具有无数实体的高价值领域。标准分类方法遭受注释瓶颈,无法有效处理看不见的实体。零拍实体链接已成为概括的方向,以概括新实体,但它仍然需要在所有实体的培训和规范描述期间提到示例,这两者都很少在维基百科外面可用。在本文中,我们通过利用易于提供的域知识来探索实体链接的知识丰富的自我监督($ \ tt kriss $)。在培训中,它会使用域本体进行未标记的文本生成自我监控的提到示例,并使用对比学习列举一个上下文编码器。出于推理,它将自我监督的提到作为每个实体的原型,并通过将测试提及映射到最相似的原型来进行链接。我们的方法归入零拍摄和少量拍摄方法,并且可以轻松地包含实体说明和黄金如果可用的标签。使用Biomedicine作为案例研究,我们对跨越生物医学文献和临床票据的七个标准数据集进行了广泛的实验。不使用任何标记信息,我们的方法为400万UMLS实体提供$ \ TT Krissbert $,这是一个Uncer Intity Linker,它可以获得新的艺术状态,优先于先前的自我监督方法,高度为20多个绝对点。
translated by 谷歌翻译
Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable model quality to vanilla Transformer models. To this date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide spectrum of tasks and datasets makes it difficult to assess relative model quality amongst many models. This paper proposes a systematic and unified benchmark, Long-Range Arena, specifically focused on evaluating model quality under long-context scenarios. Our benchmark is a suite of tasks consisting of sequences ranging from 1K to 16K tokens, encompassing a wide range of data types and modalities such as text, natural, synthetic images, and mathematical expressions requiring similarity, structural, and visual-spatial reasoning. We systematically evaluate ten well-established long-range Transformer models (Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers, Synthesizers, Sparse Transformers, and Longformers) on our newly proposed benchmark suite. Long-Range Arena paves the way towards better understanding this class of efficient Transformer models, facilitates more research in this direction, and presents new challenging tasks to tackle. Our benchmark code will be released at https://github.com/google-research/long-range-arena.
translated by 谷歌翻译
The task of referring video object segmentation aims to segment the object in the frames of a given video to which the referring expressions refer. Previous methods adopt multi-stage approach and design complex pipelines to obtain promising results. Recently, the end-to-end method based on Transformer has proved its superiority. In this work, we draw on the advantages of the above methods to provide a simple and effective pipeline for RVOS. Firstly, We improve the state-of-the-art one-stage method ReferFormer to obtain mask sequences that are strongly correlated with language descriptions. Secondly, based on a reliable and high-quality keyframe, we leverage the superior performance of video object segmentation model to further enhance the quality and temporal consistency of the mask results. Our single model reaches 70.3 J &F on the Referring Youtube-VOS validation set and 63.0 on the test set. After ensemble, we achieve 64.1 on the final leaderboard, ranking 1st place on CVPR2022 Referring Youtube-VOS challenge. Code will be available at https://github.com/Zhiweihhh/cvpr2022-rvos-challenge.git.
translated by 谷歌翻译
Referring image segmentation aims to segment the target object described by a given natural language expression. Typically, referring expressions contain complex relationships between the target and its surrounding objects. The main challenge of this task is to understand the visual and linguistic content simultaneously and to find the referred object accurately among all instances in the image. Currently, the most effective way to solve the above problem is to obtain aligned multi-modal features by computing the correlation between visual and linguistic feature modalities under the supervision of the ground-truth mask. However, existing paradigms have difficulty in thoroughly understanding visual and linguistic content due to the inability to perceive information directly about surrounding objects that refer to the target. This prevents them from learning aligned multi-modal features, which leads to inaccurate segmentation. To address this issue, we present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features by guiding the interaction between vision and language through prior position information. Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment by comparing the features of the referred object with those of related objects. Extensive experiments on three benchmarks demonstrate our PCAN performs favorably against the state-of-the-art methods. Our code will be made publicly available.
translated by 谷歌翻译